A-rsibal_PeerCommentary_Amalik03_03

Some of my best friends are Zombies…


Reflection

I enjoyed most of this homework, but some of it was a struggle. I really enjoyed trying to figure out how to make my own functions despite how frustrating it was. I struggled with question 6 because I misread the question and was trying to plot at 100 X 30 data points into a vector of vectors but realized I only needed the means of each random sample. I also struggled with finding a way to add to vectors. I used the internet and found the function “append” to make changes to a vector. Finally I needed to refresh my memory on how to analyze qq plots because I didn’t know how to use the graph. I reread Module 8 to grab a better understanding of quantile-quantile plots and its uses.I didn’t use the full capabilities of ggplot2 because I wanted to make sure I did the homework correctly. I plan on adding color and prettiness for the final submission.

1. Calculate the population mean and standard deviation for each quantitative random variable (height, weight, age, number of zombies killed, and years of education).

Ritika’s comments: I also used the curl function! I really liked how you organised your code here, specifically since you used “mean.height…”, it made it very easy to read and follow! great job!

library(curl)
## Using libcurl 7.88.1 with LibreSSL/3.3.6
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::parse_date() masks curl::parse_date()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
f <-curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/zombies.csv")
d <-read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d) #make sure it's set up right
##   id first_name last_name gender   height   weight zombies_killed
## 1  1      Sarah    Little Female 62.88951 132.0872              2
## 2  2       Mark    Duncan   Male 67.80277 146.3753              5
## 3  3    Brandon     Perez   Male 72.12908 152.9370              1
## 4  4      Roger   Coleman   Male 66.78484 129.7418              5
## 5  5      Tammy    Powell Female 64.71832 132.4265              4
## 6  6    Anthony     Green   Male 71.24326 152.5246              1
##   years_of_education                           major      age
## 1                  1                medicine/nursing 17.64275
## 2                  3 criminal justice administration 22.58951
## 3                  1                       education 21.91276
## 4                  6                  energy studies 18.19058
## 5                  3                       logistics 21.10399
## 6                  4                  energy studies 21.48355
(mean.height <- mean(d$height))
## [1] 67.6301
(mean.weight <- mean(d$weight))
## [1] 143.9075
(mean.age <- mean(d$weight))
## [1] 143.9075
(mean.kills <- mean(d$zombies_killed))
## [1] 2.992
(mean.edu <- mean(d$years_of_education))
## [1] 2.996
pop.sd <- function(x){ #create a function for population SD
  sqrt(
    sum(
      (x - mean(x))^2
    )
    / length(x)
  )
}
(height.sd <- pop.sd(d$height))
## [1] 4.30797
(weight.sd <- pop.sd(d$weight))
## [1] 18.39186
(age.sd <- pop.sd(d$age))
## [1] 2.963583
(kills.sd <- pop.sd(d$zombies_killed))
## [1] 1.747551
(edu.sd <- pop.sd(d$years_of_education))
## [1] 1.675704

2. Use {ggplot} to make boxplots of each of these variables by gender.

Ritika’s comments: Definitely not necessary here but as a suggestion you could also make the plotting into a function and pass in the x and y variables as input! But I think this section was very clear–great job!

heightplot <- ggplot(data = d, aes(x = gender, y = height)) + geom_boxplot() + xlab("Gender") + ylab("Height(inches)")
heightplot

weightplot <- ggplot(data = d, aes(x = gender, y = weight)) + geom_boxplot() + xlab("Gender") + ylab("Weight(pounds)")
weightplot

ageplot <- ggplot(data = d, aes(x = gender, y = age)) + geom_boxplot() + xlab("Gender") + ylab("Age(years)")
ageplot

killsplot <- ggplot(data = d, aes(x = gender, y = zombies_killed)) + geom_boxplot() + xlab("Gender") + ylab("Zombies Killed")
killsplot

eduplot <- ggplot(data = d, aes(x = gender, y = years_of_education)) + geom_boxplot() + xlab("Gender") + ylab("Years of Education")
eduplot

4. Using histograms and Q-Q plots, check whether the quantitative variables seem to be drawn from a normal distribution. Which seem to be and which do not (hint: not all are drawn from the normal distribution)? For those that are not normal, can you determine from which common distribution they are drawn?

Ritika’s comments: I liked your graph organisation here (both the hist and qq plots in same image) and that you used qqline in addition to qqnorm–i did the same thing! I actually said my graphs were left skewed but after reading your comments and doing a quick Google search I am now realizing that I am wrong! These are definitely right skewed graphs, so thank you!

par(mfrow = c(1,2))
hist(d$height) #Plots a basic histogram
qqnorm(d$height, frame = FALSE) #Plots a qqgraph of data points
qqline(d$height) #plots a qq line 


Yes, based on the histogram and the qqplot, the height of the survivors are drawn from a normal distribution curve.

par(mfrow = c(1,2))
hist(d$weight)
qqnorm(d$weight, frame = FALSE)
qqline(d$weight)


Yes, based on the histogram and the qqplot, the weight of the survivors are drawn from a normal distribution curve.

par(mfrow = c(1,2))
hist(d$age)
qqnorm(d$age, frame = FALSE)
qqline(d$age)


Yes, based on the histogram and the qqplot, the age of the survivors are drawn from a normal distribution curve.

par(mfrow = c(1,2))
hist(d$zombies_killed)
qqnorm(d$zombies_killed, frame = FALSE)
qqline(d$zombies_killed)


No, based on the graphs, the data does not stem from a normal distributions. This seems like the zombies killed by survivors data is drawn from a right-skewed distribution.

par(mfrow = c(1,2))
hist(d$years_of_education)
qqnorm(d$years_of_education, frame = FALSE)
qqline(d$years_of_education)


No, the data for survivors’ years of education is not drawn from a normal disrtribution. The data is looks heavily right skewed.

5. Now use the sample() function to sample ONE subset of 30 zombie survivors (without replacement) from this population and calculate the mean and sample standard deviation for each variable. Also estimate the standard error for each variable, and construct the 95% confidence interval for each mean. Note that for the variables that are not drawn from the normal distribution, you may need to base your estimate of the CIs on slightly different code than for the normal…

Ritika’s comments: cool use of sample_n–I did not know that existed and ended up downloading a separate package! One thing I think you might have missed here is that the first 3 graphs are normally distributed so we probably want to use qnorm instead of qt to calculate the CIs (i think this was in Mod 9), and then keep qt for the not normally distributed ones (last 2)

(samplesized <- sample_n(d, size = 30, replace = FALSE))
##     id first_name last_name gender   height   weight zombies_killed
## 1   36     Walter Carpenter   Male 70.90143 142.7320              3
## 2  965     Samuel  Hamilton   Male 74.16976 163.0445              3
## 3  839       Ryan    Barnes   Male 68.23992 128.3578              2
## 4  343    Beverly  Phillips Female 67.03964 138.8831              1
## 5  126       Eric     Young   Male 70.44371 171.2002              4
## 6  819     Harold Patterson   Male 70.75748 152.3995              3
## 7  776    Shirley     Jones Female 63.16838 133.4696              3
## 8  231   Kathleen    Wright Female 63.35430 118.5460              2
## 9  937    Raymond   Roberts   Male 65.37115 147.0078              2
## 10 510       Tina    Taylor Female 64.25954 124.1066              1
## 11 540      Norma  Anderson Female 62.43483 112.0983              3
## 12  13       Alan     Olson   Male 66.69701 133.6441              2
## 13 732     Steven  Reynolds   Male 68.95874 157.7417              1
## 14 436       Jack  Marshall   Male 67.57963 137.0341              4
## 15 162        Roy      Dunn   Male 75.27081 167.3627              2
## 16 872      Carol   Freeman Female 68.84493 138.4296              3
## 17 933        Amy    Willis Female 67.89424 124.2984              5
## 18 224      Tammy     Gomez Female 61.35044 120.3426              2
## 19 235      Betty    Howard Female 63.54701 145.4172              4
## 20 796      Betty    Tucker Female 63.82403 119.6649              2
## 21 472      Annie  Thompson Female 66.12076 134.5862              0
## 22 317 Jacqueline     Hicks Female 63.00016 131.8478              3
## 23  65       Gary    Morris   Male 72.98792 154.5452              4
## 24 432     Daniel  Campbell   Male 73.13875 183.5933              1
## 25 261      Tammy      Hart Female 57.63606 115.9745              4
## 26 756     Amanda    Rogers Female 60.74299 124.3453              2
## 27 455      Keith     Burns   Male 71.72562 157.6048              3
## 28 809        Roy     Mason   Male 69.61105 144.9560              2
## 29  92       Alan     Ramos   Male 65.04941 149.8804              4
## 30 368       Jane     Woods Female 69.90024 155.6157              1
##    years_of_education                           major      age
## 1                   2              physical education 23.98011
## 2                   0           agricultural sciences 23.24972
## 3                   1              physical education 19.92937
## 4                   2                medicine/nursing 20.93191
## 5                   5          mechanical engineering 20.42065
## 6                   0              physical education 21.43846
## 7                   5                          botany 17.85399
## 8                   5         business administration 21.93743
## 9                   5               culinart services 18.55335
## 10                  2                applied sciences 22.34368
## 11                  2                       education 18.06575
## 12                  1                       economics 18.59929
## 13                  3                medicine/nursing 21.93762
## 14                  2           environmental science 19.74716
## 15                  1                         biology 23.28328
## 16                  7 criminal justice administration 24.00616
## 17                  4              physical education 22.84620
## 18                  3           agricultural sciences 18.31888
## 19                  4                       economics 16.53939
## 20                  2                       logistics 21.72521
## 21                  2               culinart services 20.51588
## 22                  5                applied sciences 16.87172
## 23                  0                       education 19.81533
## 24                  5           agricultural sciences 22.57573
## 25                  1                       education 12.29306
## 26                  2                   city planning 16.33405
## 27                  1                         biology 19.30345
## 28                  3           environmental science 21.90370
## 29                  1               culinart services 15.18804
## 30                  2                    architecture 22.19631
(mean.height.s <- mean(samplesized$height))
## [1] 67.134
(mean.weight.s <- mean(samplesized$weight))
## [1] 140.9577
(mean.age.s <- mean(samplesized$age))
## [1] 20.09016
(mean.kills.s <- mean(samplesized$zombies_killed))
## [1] 2.533333
(mean.edu.s <- mean(samplesized$years_of_education))
## [1] 2.6
(sd.height.s <- pop.sd(samplesized$height))
## [1] 4.250069
(sd.weight.s <- pop.sd(samplesized$weight))
## [1] 17.56421
(sd.age.s <- pop.sd(samplesized$age))
## [1] 2.763434
(sd.kills.s <- pop.sd(samplesized$zombies_killed))
## [1] 1.175679
(sd.edu.s <- pop.sd(samplesized$years_of_education))
## [1] 1.8
standarderror <- function(sd) {
  sd/sqrt(30)
}

(standarderror(sd.height.s))
## [1] 0.7759529
(standarderror(sd.weight.s))
## [1] 3.206772
(standarderror(sd.age.s))
## [1] 0.5045317
(standarderror(sd.kills.s))
## [1] 0.2146487
(standarderror(sd.edu.s))
## [1] 0.3286335
xbar = NULL
sd= NULL
bounds <- function(xbar , sd){
  t.score <- qt(p=0.05/2, df = 29,, lower.tail = FALSE)
  lower.bound <- xbar - (t.score * sd/sqrt(30))
  upper.bound <- xbar + (t.score * sd/sqrt(30))
  print(c(lower.bound, upper.bound))
}
  
bounds(mean.height.s, sd.height.s)
## [1] 65.547 68.721
bounds(mean.weight.s, sd.weight.s)
## [1] 134.3991 147.5163
bounds(mean.age.s, sd.age.s)
## [1] 19.05828 21.12205
bounds(mean.kills.s, sd.kills.s)
## [1] 2.094327 2.972339
bounds(mean.edu.s, sd.edu.s)
## [1] 1.927869 3.272131
#Used this website to create bounds function: https://bookdown.org/logan_kelly/r_practice/p09.html

Now draw 99 more random samples of 30 zombie apocalypse survivors, and calculate the mean for each variable for each of these samples. Together with the first sample you drew, you now have a set of 100 means for each variable (each based on 30 observations), which constitutes a sampling distribution for each variable. What are the means and standard deviations of this distribution of means for each variable? How do the standard deviations of means compare to the standard errors estimated in [5]? What do these sampling distributions look like (a graph might help here)? Are they normally distributed? What about for those variables that you concluded were not originally drawn from a normal distribution?

Ritika’s comments: great job here! I think we ended up doing the sampling in almost the exact same way. I’ve been coding for a long time now and I have to say your code is very well thought out and looks like someone who is an experienced coder. I like that you reuse your functions here and that you also added the QQ plots–I only plotted the histograms so I will be definitely adding that. The only suggestion here is that I dont think you need to have par(mfrow = c(1,2)) every time. I think once you set it in a session it should hold its value for the rest of the file (until you change it)

meanheights <- c(mean.height.s)
meanweights <- c(mean.weight.s)
meanages <- c(mean.age.s)
meankills <- c(mean.kills.s)
meanedus <- c(mean.edu.s)



for (x in 1:99) { #create a loop that adds the mean vectors of the quantitative date to a vector of means
  rsample <- sample_n(d, size = 30, replace = FALSE)
  meanheights <- append(meanheights, mean(rsample$height))
  meanweights <- append(meanweights, mean(rsample$weight))
   meanages <- append(meanages, mean(rsample$age))
   meankills <- append(meankills, mean(rsample$zombies_killed))
   meanedus <- append(meanedus, mean(rsample$years_of_education))
}

meanheights
##   [1] 67.13400 68.27821 67.67389 68.23831 68.05631 68.13375 69.28150 66.39103
##   [9] 67.75279 67.68368 66.65270 68.63706 66.04277 67.46130 66.78045 67.92521
##  [17] 66.70691 68.54358 68.53892 67.47763 67.95184 69.35587 67.78857 66.22889
##  [25] 67.61154 67.43544 67.42596 68.24939 66.44887 66.63638 67.90845 67.94210
##  [33] 66.77796 67.98100 67.31928 66.20743 69.79256 68.22401 67.91228 68.64239
##  [41] 68.95279 67.22306 69.24500 67.69711 68.07894 66.10750 67.36105 67.41173
##  [49] 66.36788 67.68149 67.52382 68.72029 68.28310 68.87110 67.43820 67.51291
##  [57] 66.87908 67.02280 67.84456 66.40778 68.59410 68.02518 68.28912 67.82374
##  [65] 65.85376 66.84396 67.19964 67.50179 69.17147 68.49748 67.78277 67.73995
##  [73] 67.36755 67.64773 69.12397 66.35496 67.77447 67.83693 67.88098 66.81460
##  [81] 66.95212 67.62374 68.48978 67.73330 68.03024 67.65567 67.73450 65.93704
##  [89] 68.84943 67.87285 68.07648 66.97512 68.22720 66.75427 67.52277 68.52634
##  [97] 67.38308 67.35737 67.84445 66.98143
pop.sd(meanheights) #Use pop.sd because this set of 30 is our sample
## [1] 0.8216888
meanweights
##   [1] 140.9577 148.8191 144.6930 146.9882 144.7800 144.0543 151.3254 138.0279
##   [9] 145.9604 146.4166 140.3435 148.4224 139.5249 142.5249 140.6115 140.7086
##  [17] 138.4230 145.1214 146.0666 143.1946 145.0178 150.7471 140.6832 138.2050
##  [25] 144.4396 138.2144 143.6027 147.4786 138.1750 140.8663 146.6401 146.0239
##  [33] 144.4406 145.7809 143.7467 138.2007 154.1304 145.3050 144.1590 149.0686
##  [41] 148.2953 144.0047 149.9251 144.1349 148.1647 137.1428 146.1101 143.4973
##  [49] 141.3555 145.0650 140.8638 149.2913 147.1552 146.7595 143.3074 147.4660
##  [57] 141.0459 142.0823 146.5660 138.7607 149.5523 142.3424 146.1611 143.4733
##  [65] 139.3400 143.3765 143.3385 147.6238 149.4568 147.3603 148.4950 144.6271
##  [73] 142.0884 143.5700 149.0014 138.4415 147.7062 144.6965 143.4083 140.3828
##  [81] 138.7293 147.7390 148.8483 145.2991 146.9725 144.9653 143.1601 139.9778
##  [89] 147.0154 146.5757 143.7436 141.3445 145.9101 138.2358 143.1136 147.2740
##  [97] 145.0123 140.2812 144.2747 141.2813
pop.sd(meanweights)
## [1] 3.554466
meanages
##   [1] 20.09016 20.35339 19.92714 19.94374 20.83312 20.36676 20.66315 20.04407
##   [9] 19.93331 19.36285 19.31068 20.22151 19.24674 20.28719 19.29950 21.24731
##  [17] 19.94526 20.27997 20.66201 19.71245 20.00827 20.87650 20.29223 19.60265
##  [25] 20.78528 20.14804 20.26358 20.30540 19.47730 19.86868 20.28794 20.07563
##  [33] 19.45442 20.21944 20.40191 19.60320 20.91108 20.44651 20.61264 20.26201
##  [41] 21.11380 19.64440 20.73244 20.68375 19.77418 19.50241 19.18708 20.26785
##  [49] 18.89746 19.78174 19.83161 20.27003 20.56288 20.58653 20.13285 19.82258
##  [57] 19.32622 19.19253 19.92185 19.61020 19.52189 20.91088 20.97505 20.93122
##  [65] 19.08859 18.92208 20.20053 19.19859 21.38954 20.41330 19.55905 20.36850
##  [73] 20.18380 20.10413 20.26552 19.66762 20.46000 19.82398 20.98094 19.98325
##  [81] 20.00672 19.83595 20.06204 20.14549 19.72102 20.36216 20.36792 18.69207
##  [89] 21.10751 20.35233 21.23646 19.16595 21.05086 20.13401 20.12513 21.20104
##  [97] 20.15978 20.70634 19.80555 19.56325
pop.sd(meanages)
## [1] 0.5821113
meankills
##   [1] 2.533333 2.866667 2.533333 2.700000 2.900000 2.966667 3.366667 3.200000
##   [9] 2.800000 2.666667 2.900000 2.766667 2.933333 3.033333 3.266667 3.266667
##  [17] 2.966667 2.966667 2.666667 3.200000 3.233333 2.966667 2.900000 3.000000
##  [25] 2.866667 3.300000 3.266667 2.766667 3.133333 3.000000 3.133333 3.233333
##  [33] 3.000000 2.566667 3.033333 3.233333 2.966667 3.033333 3.000000 2.900000
##  [41] 2.900000 3.433333 2.600000 3.133333 3.233333 3.100000 3.366667 3.566667
##  [49] 2.900000 2.766667 2.566667 2.666667 2.933333 3.233333 3.600000 2.700000
##  [57] 3.066667 3.300000 2.900000 3.366667 2.833333 2.966667 2.833333 2.966667
##  [65] 2.633333 3.100000 2.500000 2.833333 2.833333 2.500000 2.500000 2.933333
##  [73] 3.400000 2.733333 2.966667 2.633333 2.900000 3.266667 3.666667 2.933333
##  [81] 3.133333 3.433333 3.100000 3.100000 2.733333 2.733333 2.733333 3.066667
##  [89] 3.066667 3.000000 2.833333 2.733333 3.100000 2.633333 2.900000 2.666667
##  [97] 2.866667 2.900000 2.933333 3.033333
pop.sd(meankills)
## [1] 0.2561467
meanedus
##   [1] 2.600000 2.766667 2.766667 3.133333 3.133333 3.133333 2.533333 3.033333
##   [9] 2.900000 2.833333 3.066667 3.133333 3.233333 3.800000 3.233333 2.600000
##  [17] 3.133333 3.066667 3.033333 3.466667 2.766667 3.333333 3.033333 2.733333
##  [25] 2.733333 3.066667 3.500000 3.100000 3.266667 2.800000 2.800000 3.000000
##  [33] 3.033333 2.733333 3.233333 2.966667 3.200000 3.433333 3.366667 2.766667
##  [41] 3.100000 2.666667 3.100000 2.566667 2.200000 3.166667 3.166667 2.766667
##  [49] 3.066667 2.633333 3.266667 2.933333 3.033333 3.200000 3.233333 3.133333
##  [57] 2.633333 3.000000 2.433333 3.100000 3.266667 3.266667 3.400000 3.200000
##  [65] 2.833333 2.966667 2.966667 2.900000 3.133333 3.533333 2.633333 2.866667
##  [73] 2.333333 3.033333 3.266667 2.800000 2.600000 2.733333 3.100000 3.033333
##  [81] 3.733333 2.833333 2.933333 2.500000 2.900000 2.766667 2.333333 3.366667
##  [89] 2.433333 3.233333 3.033333 2.766667 3.166667 2.966667 3.400000 3.466667
##  [97] 2.366667 3.133333 2.933333 3.166667


The standard deviations are smaller from this population are smaller than that of the one from Question 5.

par(mfrow = c(1,2))
hist(meanheights) #makes a basic histogram
qqnorm(meanheights, frame = FALSE) #plots a qq graph 
qqline(meanheights) #plots qq line on the graph for analysis

par(mfrow = c(1,2))
hist(meanweights) 
qqnorm(meanweights, frame = FALSE)
qqline(meanweights)

par(mfrow = c(1,2))
hist(meanages)
qqnorm(meanages, frame = FALSE)
qqline(meanages)

par(mfrow = c(1,2))
hist(meankills)
qqnorm(meankills, frame = FALSE)
qqline(meankills) 

par(mfrow = c(1,2))
hist(meanedus)
qqnorm(meanedus, frame = FALSE)
qqline(meanedus)


Based on th histograms and qqplots, they all seem more likely to be drawn from a normal distribution. However, mean weight is likely not drawn from normal distributed data.